{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# for use in tutorial and development; do not include this `sys.path` change in production:\n",
    "import sys ; sys.path.insert(0, \"../\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Explain PyTextRank: extractive summarization\n",
    "\n",
    "How does **PyTextRank** perform *extractive summarization* on a text document?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "First we perform some basic housekeeping for Jupyter, then load `spaCy` with a language model for English ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import spacy\n",
    "nlp = spacy.load(\"en_core_web_sm\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create some text to use...."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then add **PyTextRank** into the `spaCy` pipeline..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pytextrank\n",
    "\n",
    "nlp.add_pipe(\"textrank\", last=True)\n",
    "doc = nlp(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Examine the results: a list of top-ranked phrases in the document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "ic| p.rank: 0.18359439311764025, p.count: 1, p.text: 'mixed types'\n",
      "ic| p.chunks: [mixed types]\n",
      "ic| p.rank: 0.17847961931078207, p.count: 3, p.text: 'systems'\n",
      "ic| p.chunks: [systems, systems, systems]\n",
      "ic| p.rank: 0.15037838042245094\n",
      "    p.count: 1\n",
      "    p.text: 'minimal generating sets'\n",
      "ic| p.chunks: [minimal generating sets]\n",
      "ic| p.rank: 0.14740065982407316\n",
      "    p.count: 1\n",
      "    p.text: 'nonstrict inequations'\n",
      "ic| p.chunks: [nonstrict inequations]\n",
      "ic| p.rank: 0.13946027725597837\n",
      "    p.count: 1\n",
      "    p.text: 'strict inequations'\n",
      "ic| p.chunks: [strict inequations]\n",
      "ic| p.rank: 0.1195023546245721\n",
      "    p.count: 1\n",
      "    p.text: 'linear Diophantine equations'\n",
      "ic| p.chunks: [linear Diophantine equations]\n",
      "ic| p.rank: 0.11450088293222845, p.count: 1, p.text: 'natural numbers'\n",
      "ic| p.chunks: [natural numbers]\n",
      "ic| p.rank: 0.1078071817368632, p.count: 3, p.text: 'solutions'\n",
      "ic| p.chunks: [solutions, solutions, solutions]\n",
      "ic| p.rank: 0.10529828014583348\n",
      "    p.count: 1\n",
      "    p.text: 'linear constraints'\n",
      "ic| p.chunks: [linear constraints]\n",
      "ic| p.rank: 0.10369605907081418\n",
      "    p.count: 1\n",
      "    p.text: 'all the considered types systems'\n",
      "ic| p.chunks: [all the considered types systems]\n",
      "ic| p.rank: 0.08812713074893187\n",
      "    p.count: 1\n",
      "    p.text: 'a minimal supporting set'\n",
      "ic| p.chunks: [a minimal supporting set]\n",
      "ic| p.rank: 0.08243620500315357, p.count: 1, p.text: 'a system'\n",
      "ic| p.chunks: [a system]\n",
      "ic| p.rank: 0.07944607954086784, p.count: 1, p.text: 'a minimal set'\n",
      "ic| p.chunks: [a minimal set]\n",
      "ic| p.rank: 0.0763527926213032, p.count: 1, p.text: 'algorithms'\n",
      "ic| p.chunks: [algorithms]\n",
      "ic| p.rank: 0.07593126037016427, p.count: 1, p.text: 'all types'\n",
      "ic| p.chunks: [all types]\n",
      "ic| p.rank: 0.07309361902551356, p.count: 1, p.text: 'Diophantine'\n",
      "ic| p.chunks: [Diophantine]\n",
      "ic| p.rank: 0.0702090100898443, p.count: 1, p.text: 'construction'\n",
      "ic| p.chunks: [construction]\n",
      "ic| p.rank: 0.060225391238828516, p.count: 1, p.text: 'Upper bounds'\n",
      "ic| p.chunks: [Upper bounds]\n",
      "ic| p.rank: 0.05800111772673988, p.count: 1, p.text: 'the set'\n",
      "ic| p.chunks: [the set]\n",
      "ic| p.rank: 0.05425139476531647, p.count: 1, p.text: 'components'\n",
      "ic| p.chunks: [components]\n",
      "ic| p.rank: 0.04516904342912139, p.count: 1, p.text: 'Compatibility'\n",
      "ic| p.chunks: [Compatibility]\n",
      "ic| p.rank: 0.04516904342912139, p.count: 1, p.text: 'compatibility'\n",
      "ic| p.chunks: [compatibility]\n",
      "ic| p.rank: 0.04435648606848154\n",
      "    p.count: 1\n",
      "    p.text: 'the corresponding algorithms'\n",
      "ic| p.chunks: [the corresponding algorithms]\n",
      "ic| p.rank: 0.042273783712246285, p.count: 1, p.text: 'Criteria'\n",
      "ic| p.chunks: [Criteria]\n",
      "ic| p.rank: 0.01952542432474353, p.count: 1, p.text: 'These criteria'\n",
      "ic| p.chunks: [These criteria]\n"
     ]
    }
   ],
   "source": [
    "from icecream import ic\n",
    "\n",
    "for p in doc._.phrases:\n",
    "    ic(p.rank, p.count, p.text)\n",
    "    ic(p.chunks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Construct a list of the sentence boundaries with a phrase vector (initialized to empty set) for each..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[0, 13, set()], [13, 33, set()], [33, 61, set()], [61, 91, set()]]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sent_bounds = [ [s.start, s.end, set([])] for s in doc.sents ]\n",
    "sent_bounds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Iterate through the top-ranked phrases, added them to the phrase vector for each sentence..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "ic| phrase_id: 0, p.text: 'mixed types', p.rank: 0.18359439311764025\n",
      "ic| chunk.start: 88, chunk.end: 90\n",
      "ic| sent_start: 61, chunk.start: 88, chunk.end: 90, sent_end: 91\n",
      "ic| phrase_id: 1, p.text: 'systems', p.rank: 0.17847961931078207\n",
      "ic| chunk.start: 2, chunk.end: 3\n",
      "ic| sent_start: 0, chunk.start: 2, chunk.end: 3, sent_end: 13\n",
      "ic| chunk.start: 57, chunk.end: 58\n",
      "ic| sent_start: 33, chunk.start: 57, chunk.end: 58, sent_end: 61\n",
      "ic| chunk.start: 86, chunk.end: 87\n",
      "ic| sent_start: 61, chunk.start: 86, chunk.end: 87, sent_end: 91\n",
      "ic| phrase_id: 2\n",
      "    p.text: 'minimal generating sets'\n",
      "    p.rank: 0.15037838042245094\n",
      "ic| chunk.start: 48, chunk.end: 51\n",
      "ic| sent_start: 33, chunk.start: 48, chunk.end: 51, sent_end: 61\n",
      "ic| phrase_id: 3\n",
      "    p.text: 'nonstrict inequations'\n",
      "    p.rank: 0.14740065982407316\n",
      "ic| chunk.start: 28, chunk.end: 30\n",
      "ic| sent_start: 13, chunk.start: 28, chunk.end: 30, sent_end: 33\n"
     ]
    }
   ],
   "source": [
    "limit_phrases = 4\n",
    "\n",
    "phrase_id = 0\n",
    "unit_vector = []\n",
    "\n",
    "for p in doc._.phrases:\n",
    "    ic(phrase_id, p.text, p.rank)\n",
    "    \n",
    "    unit_vector.append(p.rank)\n",
    "    \n",
    "    for chunk in p.chunks:\n",
    "        ic(chunk.start, chunk.end)\n",
    "        \n",
    "        for sent_start, sent_end, sent_vector in sent_bounds:\n",
    "            if chunk.start >= sent_start and chunk.end <= sent_end:\n",
    "                ic(sent_start, chunk.start, chunk.end, sent_end)\n",
    "                sent_vector.add(phrase_id)\n",
    "                break\n",
    "\n",
    "    phrase_id += 1\n",
    "\n",
    "    if phrase_id == limit_phrases:\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at the results..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[0, 13, {1}], [13, 33, {3}], [33, 61, {1, 2}], [61, 91, {0, 1}]]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sent_bounds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "ic| sent: Compatibility of systems of linear constraints over the set of natural numbers.\n",
      "ic| sent: Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered.\n",
      "ic| sent: Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given.\n",
      "ic| sent: These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.\n"
     ]
    }
   ],
   "source": [
    "for sent in doc.sents:\n",
    "    ic(sent)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also construct a `unit_vector` for all of the phrases, up to the limit requested..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.18359439311764025,\n",
       " 0.17847961931078207,\n",
       " 0.15037838042245094,\n",
       " 0.14740065982407316]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unit_vector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then normalized..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.2782352712825618,\n",
       " 0.2704838881736656,\n",
       " 0.2278967715809441,\n",
       " 0.22338406896282853]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum_ranks = sum(unit_vector)\n",
    "\n",
    "unit_vector = [ rank/sum_ranks for rank in unit_vector ]\n",
    "unit_vector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Iterate through each sentence, calculating its *euclidean distance* from the unit vector..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "ic| sent_vector: {1}\n",
      "ic| phrase_id: 0, unit_vector[phrase_id]: 0.2782352712825618\n",
      "ic| phrase_id: 1, unit_vector[phrase_id]: 0.2704838881736656\n",
      "ic| phrase_id: 2, unit_vector[phrase_id]: 0.2278967715809441\n",
      "ic| phrase_id: 3, unit_vector[phrase_id]: 0.22338406896282853\n",
      "ic| sent_vector: {3}\n",
      "ic| phrase_id: 0, unit_vector[phrase_id]: 0.2782352712825618\n",
      "ic| phrase_id: 1, unit_vector[phrase_id]: 0.2704838881736656\n",
      "ic| phrase_id: 2, unit_vector[phrase_id]: 0.2278967715809441\n",
      "ic| phrase_id: 3, unit_vector[phrase_id]: 0.22338406896282853\n",
      "ic| sent_vector: {1, 2}\n",
      "ic| phrase_id: 0, unit_vector[phrase_id]: 0.2782352712825618\n",
      "ic| phrase_id: 1, unit_vector[phrase_id]: 0.2704838881736656\n",
      "ic| phrase_id: 2, unit_vector[phrase_id]: 0.2278967715809441\n",
      "ic| phrase_id: 3, unit_vector[phrase_id]: 0.22338406896282853\n",
      "ic| sent_vector: {0, 1}\n",
      "ic| phrase_id: 0, unit_vector[phrase_id]: 0.2782352712825618\n",
      "ic| phrase_id: 1, unit_vector[phrase_id]: 0.2704838881736656\n",
      "ic| phrase_id: 2, unit_vector[phrase_id]: 0.2278967715809441\n",
      "ic| phrase_id: 3, unit_vector[phrase_id]: 0.22338406896282853\n"
     ]
    }
   ],
   "source": [
    "from math import sqrt\n",
    "\n",
    "sent_rank = {}\n",
    "sent_id = 0\n",
    "\n",
    "for sent_start, sent_end, sent_vector in sent_bounds:\n",
    "    ic(sent_vector)\n",
    "    sum_sq = 0.0\n",
    "    ic\n",
    "    for phrase_id in range(len(unit_vector)):\n",
    "        ic(phrase_id, unit_vector[phrase_id])\n",
    "        \n",
    "        if phrase_id not in sent_vector:\n",
    "            sum_sq += unit_vector[phrase_id]**2.0\n",
    "\n",
    "    sent_rank[sent_id] = sqrt(sum_sq)\n",
    "    sent_id += 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "ic| sent_rank: {0: 0.4233819161809908,\n",
      "                1: 0.4500148202495578,\n",
      "                2: 0.3568127078063091,\n",
      "                3: 0.31911969660835215}\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{0: 0.4233819161809908,\n",
       " 1: 0.4500148202495578,\n",
       " 2: 0.3568127078063091,\n",
       " 3: 0.31911969660835215}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ic(sent_rank)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sort the sentence indexes in descending order"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(3, 0.31911969660835215),\n",
       " (2, 0.3568127078063091),\n",
       " (0, 0.4233819161809908),\n",
       " (1, 0.4500148202495578)]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from operator import itemgetter\n",
    "\n",
    "sorted(sent_rank.items(), key=itemgetter(1)) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Extract the sentences with the lowest distance, up to the limit requested..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "ic| sent_id: 3\n",
      "    sent_text[sent_id]: ('These criteria and the corresponding algorithms for constructing a minimal '\n",
      "                         'supporting set of solutions can be used in solving all the considered types '\n",
      "                         'systems and systems of mixed types.')\n",
      "ic| sent_id: 2\n",
      "    sent_text[sent_id]: ('Upper bounds for components of a minimal set of solutions and algorithms of '\n",
      "                         'construction of minimal generating sets of solutions for all types of '\n",
      "                         'systems are given.')\n"
     ]
    }
   ],
   "source": [
    "limit_sentences = 2\n",
    "\n",
    "sent_text = {}\n",
    "sent_id = 0\n",
    "\n",
    "for sent in doc.sents:\n",
    "    sent_text[sent_id] = sent.text\n",
    "    sent_id += 1\n",
    "\n",
    "num_sent = 0\n",
    "\n",
    "for sent_id, rank in sorted(sent_rank.items(), key=itemgetter(1)):\n",
    "    ic(sent_id, sent_text[sent_id])\n",
    "    num_sent += 1\n",
    "    \n",
    "    if num_sent == limit_sentences:\n",
    "        break"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
